Search CORE

69 research outputs found

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Author: A Farruggia
C Boucher
C Hoobin
D Belazzougui
H Ferrada
J Ziv
J Ziv
M Léonard
P Ferragina
R Raman
S Deorowicz
S Deorowicz
S Kuruppu
Publication venue
Publication date: 01/01/2016
Field of study

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Even faster sorting of (not only) integers

Author: C Hoare
D Knuth
D Musser
D Shell
J Shen
J Williams
M Codish
M Kokot
PM McIlroy
S Deorowicz
T Cormen
Publication venue
Publication date: 02/03/2017
Field of study

In this paper we introduce RADULS2, the fastest parallel sorter based on radix algorithm. It is optimized to process huge amounts of data making use of modern multicore CPUs. The main novelties include: extremely optimized algorithm for handling tiny arrays (up to about a hundred of records) that could appear even billions times as subproblems to handle and improved processing of larger subarrays with better use of non-temporal memory stores

arXiv.org e-Print Archive

Crossref

Improving Transmission Efficiency of Large Sequence Alignment/Map (SAM) Files

Author: C Kozanitis
C Wang
Chin-Tser Huang
H Li
Jijun Tang
Leonardo Mariño-Ramírez
Muhammad Nazmus Sakib
S Deorowicz
W. Jim Zheng
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Research in bioinformatics primarily involves collection and analysis of a large volume of genomic data. Naturally, it demands efficient storage and transfer of this huge amount of data. In recent years, some research has been done to find efficient compression algorithms to reduce the size of various sequencing data. One way to improve the transmission time of large files is to apply a maximum lossless compression on them. In this paper, we present SAMZIP, a specialized encoding scheme, for sequence alignment data in SAM (Sequence Alignment/Map) format, which improves the compression ratio of existing compression tools available. In order to achieve this, we exploit the prior knowledge of the file format and specifications. Our experimental results show that our encoding scheme improves compression ratio, thereby reducing overall transmission time significantly

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Predicting the deleterious effects of mutation load in fragmented populations.

Author: A. Apostolico
A. Moffat
A.S. Fraenkel
D.A. Huffman
D.A. Lelewer
G. Manzini
G. Manzini
I.H. Witten
J.L. Bentley
P. Elias
P. Ferragina
S. Deorowicz
Publication venue
Publication date: 01/01/2006
Field of study

Human-induced habitat fragmentation constitutes a major threat to biodiversity. Both genetic and demographic factors combine to drive small and isolated populations into extinction vortices. Nevertheless, the deleterious effects of inbreeding and drift load may depend on population structure, migration patterns, and mating systems and are difficult to predict in the absence of crossing experiments. We performed stochastic individual-based simulations aimed at predicting the effects of deleterious mutations on population fitness (offspring viability and median time to extinction) under a variety of settings (landscape configurations, migration models, and mating systems) on the basis of easy-to-collect demographic and genetic information. Pooling all simulations, a large part (70%) of variance in offspring viability was explained by a combination of genetic structure (F(ST)) and within-deme heterozygosity (H(S)). A similar part of variance in median time to extinction was explained by a combination of local population size (N) and heterozygosity (H(S)). In both cases the predictive power increased above 80% when information on mating systems was available. These results provide robust predictive models to evaluate the viability prospects of fragmented populations

Crossref

Serveur académique lausannois

ProdInra

Relative Lempel-Ziv Compression of Suffix Arrays

Author: A Farruggia
C Hoobin
D Belazzougui
J Ziv
M Cáceres
NJ Larsson
P Ferragina
R González
S Deorowicz
S Kuruppu
T Gagie
T Gagie
U Manber
V Mäkinen
Publication venue: Springer Science and Business Media Deutschland GmbH
Publication date: 01/01/2020
Field of study

We show that a combination of differential encoding, random sampling, and relative Lempel-Ziv (RLZ) parsing is effective for compressing suffix arrays, while simultaneously allowing very fast decompression of arbitrary suffix array intervals, facilitating pattern matching. The resulting text index, while somewhat larger (5-10x) than the recent r-index of Gagie, Navarro, and Prezza (Proc. SODA ’18)—still provides significant compression, and allows pattern location queries to be answered more than two orders of magnitude faster in practice.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

State-of-the-Art in Weighted Finite-State Spell-Checking

Author: A. Savary
A. Wilcox-O’Hearn
C.E. Shannon
E. Brill
E. Mays
F.J. Damerau
J. Otero
J. Raviv
K. Schulz
K.. Kukich
K.. Oflazer
K.W. Church
S. Deorowicz
Publication venue: Springer-Verlag
Publication date: 01/01/2014
Field of study

Proceeding volume: 2The following claims can be made about finite-state methods for spell-checking: 1) Finite-state language models provide support for morphologically complex languages that word lists, affix stripping and similar approaches do not provide; 2) Weighted finite-state models have expressive power equal to other, state-of-the-art string algorithms used by contemporary spell-checkers; and 3) Finite-state models are at least as fast as other string algorithms for lookup and error correction. In this article, we use some contemporary non-finite-state spell-checking methods as a baseline and perform tests in light of the claims, to evaluate state-of-the-art finite-state spell-checking methods. We verify that finite-state spell-checking systems outperform the traditional approaches for English. We also show that the models for morphologically complex languages can be made to perform on par with English systems.Peer reviewe

CiteSeerX

Crossref

Helsingin yliopiston digitaalinen arkisto

LW-FQZip 2: a parallelized reference-based compression of FASTQ files

Author: BA Flusberg
C Kozanitis
D Branton
DC Jones
EL Dijk van
F Hach
F Hach
G Benoit
H Li
I Numanagic
JK Bonfield
L Janin
L Roguski
M Hosseini
M Howison
M Nicolae
MH-Y Fritz
Qingjin Deng
R Patro
R Rozov
S Deorowicz
S Grabowski
SD Kahn
W Tembe
Y Zhang
Ying Chu
Yiwen Sun
Z Zhu
Z Zhu
Zexuan Zhu
Zhenkun Wen
Zhi-An Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Sequencing and de novo assembly of 150 genomes from Denmark as a population reference

Author: A Helgason
A Kong
A Telenti
AD Børglum
Ali Syed
Anders D. Børglum
Anders E. Halager
Anders Krogh
Bent Petersen
BJ Stucky
Chen Ye
Christian N. S. Pedersen
Christian Theil Have
Christina M. Hultman
David Westergaard
DF Gudbjartsson
Esben Flindt
Francesco Lescai
G Lunter
GA Van der Auwera
GD Poznik
GM Cooper
H Cao
H Eiberg
H Kupfermann
H Li
H Li
H Li
Hans Eiberg
Hongzhi Cao
J Huddleston
Jacob Malte Jensen
Jakob Grove
Jette Bork-Jensen
Jihua Sun
Johan van Beusekom
Jonas Andreas Sibbesen
Jose M. G. Izarzugaza
JS Seo
JT Simpson
Jun Wang
Junhua Rao
K Katoh
K Tamura
Karsten Kristiansen
Kirstine Belling
KM Steinberg
L Paternoster
Lars Bolund
Lasse Maretty
Laurits Skov
LC Francioli
M Lek
M Nothnagel
M Oven
M Pendleton
MA Eberle
Maria Luisa Matey-Hernandez
Marie Grosjean
MC Frith
Mikkel Heide Schierup
MR Hoehe
Ning Li
Ole Lund
Ole Mors
Oluf Pedersen
P Rice
Palle Villesen
Patrick Sullivan
Peter Løngren
PH Sudmant
PL Auer
R Hubley
R Luo
Rachita Yadav
Ramneek Gupta
Ruiqi Xu
Rune M. Friborg
S Besenbacher
S Deorowicz
S Gnerre
S Liu
S Ripke
SF Altschul
Shengting Li
Shujia Huang
Simon Rasmussen
Siyang Liu
SM Kiełbasa
Stephanie Le Hellard
Søren Besenbacher
Søren Brunak
T Espeseth
T Magocˇ
Thomas D. Als
Thomas Espeseth
Thomas Mailund
Thomas Sicheritz-Pontén
Thorkild I. A. Sørensen
Torben Hansen
VA Schneider
Weijian Ye
WP Kloosterman
WS Wong
Xiaosen Guo
Xun Xu
Yuqi Chang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Hundreds of thousands of human genomes are now being sequenced to characterize genetic variation and use this information to augment association mapping studies of complex disorders and other phenotypic traits. Genetic variation is identified mainly by mapping short reads to the reference genome or by performing local assembly. However, these approaches are biased against discovery of structural variants and variation in the more complex parts of the genome. Hence, large-scale de novo assembly is needed. Here we show that it is possible to construct excellent de novo assemblies from high-coverage sequencing with mate-pair libraries extending up to 20 kilobases. We report de novo assemblies of 150 individuals (50 trios) from the GenomeDenmark project. The quality of these assemblies is similar to those obtained using the more expensive long-read technology. We use the assemblies to identify a rich set of structural variants including many novel insertions and demonstrate how this variant catalogue enables further deciphering of known association mapping signals. We leverage the assemblies to provide 100 completely resolved major histocompatibility complex haplotypes and to resolve major parts of the Y chromosome. Our study provides a regional reference genome that we expect will improve the power of future association mapping studies and hence pave the way for precision medicine initiatives, which now are being launched in many countries including Denmark

Crossref

Copenhagen University Research Information System

Carolina Digital Repository

Online Research Database In Technology

Reference-free SNP detection: dealing with the data deluge

Author: AR Quinlan
D Zerbino
Dan MacLean
E Quillery
J Butler
J Catchen
JF Nijkamp
JT Simpson
KJV Nordström
MR Miller
NA Baird
P Medvedev
P Peterlongo
PA Hohenlohe
PA Pevzner
R Li
Richard M Leggett
RM Leggett
S Deorowicz
Y Jiang
Z Iqbal
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recommended from our members

Computational solutions for omics data

Author: A Butte
A Chatr-aryamontri
A Franceschini
A Joshi
A Lan
A Mortazavi
A Subramanian
A Tanay
AC Jungkamp
AJ Pinho
AK Wong
AR Whitney
B Langmead
B Langmead
B Paten
Bonnie Berger
BP Kelley
C Huttenhower
C Kingsford
C Trapnell
C Trapnell
C Trapnell
C Wang
CH Yeang
CJ Vaske
CS Liao
D Croft
D Earl
D Kim
D Kim
D Park
DB Allison
DB Jaffe
DR Zerbino
E Banks
E Banks
E Cerami
E Nabieva
E Segal
E Yeger-Lotem
EJ Rossin
ER Mardis
ES Lander
ET Wang
F Hach
F Hach
F Markowetz
F Ozsolak
F Vandin
F Vandin
F Vezzi
GE Zinman
H Li
H Li
I Ulitsky
I Ulitsky
IA Adzhubei
J Butler
J Clarke
J Flannick
J Goecks
J Lamb
J Pandey
JC Marioni
JC Venter
Jian Peng
JT Dudley
JT Leek
JT Simpson
JT Simpson
K Rhrissorrakrai
KI Goh
KY Yeung
L Parts
LD Stein
LH Hartwell
LM Heiser
LR Meyer
M Ascano
M Burrows
M Garber
M Gross
M Gstaiger
M Hafner
M Hsi-Yang Fritz
M Kircher
M Koyuturk
M Narayanan
M Reich
M Schatz
M Schmid
M Sirota
M Steffen
M Yandell
MB Gerstein
MB Gerstein
MC Brandon
MC Schatz
MG Grabherr
MH Maathuis
ML Metzker
Mona Singh
N Atias
N de Souza
N Tuncbag
NP Palmer
NT Ingolia
O Hirose
O Litvin
O Ogasawara
O Stegle
O Vanunu
P Ferragina
P Flicek
P Jiang
P Kumar
P Lu
P Shannon
PA Pevzner
PE Compeau
PG Doyle
PO Brown
PR Loh
PR Schmid
R Colak
R Gaujoux
R Li
R Li
R Li
R Singh
RC Gentleman
S Anders
S Batzoglou
S Christley
S Deorowicz
S Erten
S Kohler
S Levy
S Navlakha
S Ng
S Suthram
SA Chowdhury
SD Kahn
SF Altschul
SG Tringe
SL Salzberg
SS Huang
SS Shen-Orr
T Barrett
T Ideker
T Michoel
TS Furey
U Manber
UD Akavia
W Ali
W Li
W Tembe
WJ Kent
X Liu
X Wang
X Zhou
Y Prat
Y Wang
Y Zhang
YA Kim
Z Tu
Z Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2013
Field of study

High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.National Institutes of Health (U.S.) (Grant GM081871

Princeton University Open Access Repository

DSpace@MIT

Crossref

PubMed Central